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ABSTRACT 

Efficient similarity retrieval from large-scale multimodal database 
is pervasive in modern search engines and social networks. To sup¬ 
port queries across content modalities, the system should enable 
cross-modal correlation and computation-efficient indexing. While 
hashing methods have shown great potential in achieving this goal, 
current attempts generally fail to learn isomorphic hash codes in a 
seamless scheme, that is, they embed multiple modalities in a con¬ 
tinuous isomorphic space and separately threshold embeddings into 
binary codes, which incurs substantial loss of retrieval accuracy. In 
this paper, we approach seamless multimodal hashing by proposing 
a novel Composite Correlation Quantization (CCQ) model. Specif¬ 
ically, CCQ jointly finds correlation-maximal mappings that trans¬ 
form different modalities into isomorphic latent space, and learns 
composite quantizers that convert the isomorphic latent features 
into compact binary codes. An optimization framework is devised 
to preserve both intra-modal similarity and inter-modal correlation 
through minimizing both reconstruction and quantization errors, 
which can be trained from both paired and partially paired data in 
linear time. A comprehensive set of experiments clearly show the 
superior effectiveness and efficiency of CCQ against the state of the 
art hashing methods for both unimodal and cross-modal retrieval. 
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1. INTRODUCTION 

While big data with large volume, high dimensions, and multiple 
modalities are ubiquitous in search engines and social networks, it 
has attracted increasing attention to distill the correlation structures 
across heterogenous data modalities. For example, an uploaded im¬ 
age on Flickr is usually annotated with some relevant descriptions 
or tags, while a featured article on Wikipedia may consist of some 
correlative images. As relevant data from different modalities may 
endow semantic correlations, it is desirable to support multimodal 
search, which retrieves semantically-relevant results of all modals 
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in response to a unimodal query. Taking Flickr as an example, when 
a query image is given, the system should return both relevant tags 
and images. Due to large volume and semantic gap fT^ , effective 
and efficient retrieval of multimodal data remains a challenge. 

In the case that the reference database is large-scale or that the 
distance calculation between query item and database item is costly, 
an efficient solution to enabling similarity search is hashing based 
methods (22) , which perform approximate nearest neighbor (ANN) 
search with both computation efficiency and acceptable accuracy. 
The principle of hashing is to transform high-dimensional data into 
compact binary codes and generate similar binary codes for similar 
data items. The seminal work includes Locality Sensitive Hashing 
(LSH) Q and Spectral Hashing (SH) However, traditional 
unimodal hashing methods cannot support multimodal search as 
ANN cannot be directly computed across different modalities. 

Recently, several useful attempts have been made to multimodal 
hashing, which builds correlation structures across multiple modal¬ 
ities in the process of hash function learning and index multimodal 
data in a common Hamming space |[5l [2^p^[^[^[^[20l|24||27] 
[^[^[^[^. These methods generally work in two-step pipeline: 
first, embed multiple data modalities into a continuous isomorphic 
latent space by maximizing inter-modal correlations, and second, 
quantize the isomorphic embeddings into binary hash codes by sign 
thresholding. While showing promising performance, the two-step 
pipeline may encounter two limitations: first, conversion from real¬ 
valued features to discrete codes may incur substantial information 
loss, making the continuous latent space suboptimal for binary cod¬ 
ing and the binary codes suboptimal for retrieval |[^[^; second, 
directly binarizing latent features may lead to unbalanced encoding 
schemes | [^[^ . Fundamentally, by continuous relaxation of the 
binary constraints, most methods solve an optimization problem 
which may deviate significantly from the hashing objective as the 
quantization error is not accounted for in the optimization process. 
This somewhat contradicts the motivation of multimodal hashing. 
Hence, how to learn isomorphic hash codes for multimodal data in 
a seamless optimization framework remains an open problem. 

In this paper, we propose Composite Correlation Quantization 
(CCQ), a novel model towards seamless multimodal hashing. Tech¬ 
nically, CCQ jointly finds correlation-maximal mappings that trans¬ 
form different modalities into an isomorphic latent space, and learns 
composite quantizers that convert the isomorphic latent features 
into compact binary codes. The fiowcharts of CCQ and prior work 
are shown in Figure To create a seamless optimization frame¬ 
work, we are inspired by Latent Semantic Analysis (LSA) (?] and 
decompose each datum into three latent factors, namely, correlation- 
maximal mapping, similarity-preserving codebook, and compact 
binary code. The three latent factors are jointly learned through an 
optimization problem, which preserves both intra-modal similarity 




Figure 1: Flowcharts of prior work (left) and CCQ (right). Prior work is a two-step pipeline: first map image-text pairs to isomorphic 
latent space (denoted as polygon) and then binarize the continuous representation to hash codes (denoted as vertices of hypercube) 
by sign thresholding, CCQ is a seamless optimization framework: jointly map both paired/unpaired images and texts to isomorphic 
latent space (denoted as polygon) and learn hash codes by composite quantization. The quantization model learns isomorphic code¬ 
book (denoted as Voronoi digram) and binary codes (denoted as histograms) by minimizing the quantization error, which suffices to 
assign each latent representation to M-nearest codewords (denoted as Voronoi cells) and assignment indices are used as hash codes. 


and inter-modal correlation while minimizing both reconstruction 
and quantization errors. The CCQ model can construct extremely 
compressed and balanced binary codes to enable efficient multi¬ 
modal search, can readily handle a ubiquitous semi-paired scenario 
where only a fraction of input data are multimodal, and can scale 
linearly to large sample size. Comprehensive empirical evidence on 
large-scale datasets confirms that the CCQ model exhibits superior 
performance in both effectiveness and efficiency on both unimodal 
and cross-modal search against state of the art hashing methods. 

The subsequent paper is organized as follows. We review related 
works in Section We formally present our model in Section 
and algorithm with analysis in Section|^ Empirical evaluations are 
reported in Sectionj^ while conclusions are enclosed in Sectionj^ 

2. RELATED WORK 

Recently, hashing-based multimodal search is a prevalent research 
focus in machine learning and information retrieval communities 

which enables approximate 
similarity search on multimedia database with significant speedup 
and acceptable accuracy. Refer to p2) for a comprehensive survey. 

Existing multimodal hashing methods can be organized into two 
categories: supervised methods and unsupervised methods. CMSSH 
Q, SCM 1^, QCH j^, and SePH are supervised hashing 
methods that require labeled pairs to indicate if the objects from dif¬ 
ferent modalities are similar (positive) or dissimilar (negative). As 
supervised information is usually unavailable in many applications, 
the deployment of these methods may be severely restricted. CVH 
fT^ , IMH j^, MSAE and CorrAE are unsupervised hash¬ 
ing methods applicable to the most general multimodal retrieval 
case given that paired data are available, while our proposed CCQ 
model falls into this category. IMH pO) is an extension of spectral 
hashing p5) to multimodal data, which is restricted by the train¬ 
ing burden since constructing and eigendecomposing the similarity 
matrices require While CVH tackles the scalability 

issue, it does not jointly maximize cross-modality correlation and 
preserve intra-modality similarity. MSAE and CorrAE |[^ can 
capture both intra-modal similarity and inter-modal correlation by 
deep autoencoders, but they require spectral hashing or sign thresh¬ 
olding for obtaining binary codes from the continuous embeddings, 
which will give rise to uncontrollable quantization errors |[^p^. 

A crucial problem with existing methods is that they essentially 
work in a separated two-step pipeline: first embed multimodal data 
into a common continuous latent space and then threshold the con¬ 
tinuous embeddings into binary codes of the Hamming space. Such 
conversion from real-valued features to discrete codes may result 
in substantial information loss, making the continuous latent space 
suboptimal for the binary codes and the binary codes suboptimal 
for retrieval pO) . Eurthermore, directly binarizing latent represen¬ 
tation may lead to unbalanced encoding schemes, as shown in 


p^ . Although IMVH p0| learns multimodal hash functions using 
a graph-cut quantizer instead of the sign thresholding, the quantizer 
solves a fast approximation of energy function with orthogonal con¬ 
straints and recurs large quantization error and unbalanced codes. 
CCQ approaches this problem by learning the modality-consistent 
latent space and balanced binary codes in a principled framework. 

3. COMPOSITE CORRELATION QUANTI¬ 
ZATION 

3.1 Problem Statements 

In the multimodal search system, the database and query consist 
of objects from different modalities. We only use image and text as 
two modalities to explain our approach, but the approach is formu¬ 
lated to support any number V of modalities. Let ^ be 

an image set of Nq images with tags and the rest Ni images without 
tags, where Ni = Nq -\- Ni and each image is represented by Pi- 
dimensional feature vector. Let G ^ be a text set of Nq 
documents of the image tags and additional N 2 documents, where 
N2 = No N2 and each text is represented by P 2 -dimensional 
feature vector. Note that the proposed approach can handle semi- 
paired data where only a fraction No/{Ni N2) of objects are 
multimodal, and is more realistic than typical multimodal methods. 

An efficient approach to calculating the distance between image 
and text is to map images and texts to modality-isomorphic binary 
codes in which different modalities of the objects are comparable. 
In this paper, we will approach this problem by a joint optimization 
framework, dubbed Composite Correlation Quantization (CCQ). 

Definition 1 (CCQ). Given an image G and a text 
G learn two correlation-maximal mappings 1 -^ 

R^ and : R^^ 1 -^ R^ that transform images and texts into a 
D-dimensional isomorphic latent space, and jointly learn two com¬ 
posite quantizers : R^ 1 -^ {0,1}^ and q^ : R^ 1 -^ {0,1}^ 
that quantize latent embeddings into compact H-bits binary codes. 

In the common iT-bits binary space, image and text can be easily 
comparable such that both intra-modal and cross-modal search can 
be readily supported. After mappings /^, and quantizers 
have been learned, the multimodal search problem can be converted 
into classical approximate nearest neighbor (ANN) search problem. 

3.2 Composite Correlation Quantization 

The main idea of CCQ is to jointly learn a correlation-maximal 
latent space and a similarity-preserving composite quantization in 
a unified optimization framework. To achieve this mission, we are 
inspired by Latent Semantic Analysis (LSA) m and decompose 
each input datum (image or text) xj^ into three latent factors R^, 
C^, that is, x^ R^C^b^. While sharing similar formation 






















































as LSA, our formulation endows these latent factors with different 
semantics and thus constrains them with different conditions. More 
specifically, is correlation-maximal mapping, is similarity¬ 
preserving codebook, and is the compact binary code of . We 
present how to formulate the CCQ approach under these semantics. 


sharing strategy does not apply. Hence, the proposed condition that 
the modality-consistent latent space should satisfy is formulated as 


Cl, = Cm and 


J TT' — 1 . . . A^o 

otherwise, 


3.2.1 Intra-Modality Similarity Quantization 
To represent inputs with compact binary codes, two mainstream 
paradigms are sign thresholding in Hamming embedding methods 
p5) , and vector quantization in codebook-based encoding methods 
fT^ . As sign thresholding cannot guarantee minimal quantization 
error, we therefore adopt the vector quantization paradigm. CCQ is 
based on a set of M codebooks = [Ci,..., where each 
codebook contains K codewords ..., 

and each codeword is a D-dimensional vector like the cluster 
centroid in kmeans clustering. Corresponding to the M codebooks, 
we partition the binary codewords assignment vector into M 1- 
of-K indicator vectors b^ = [bi^;...; bj^^], and each indicator 
vector indicates which one (and only one) of the K codewords 
in the mth codebook is selected to approximate the nth data point. 
The CCQ model encodes each as the sum of M codewords, one 
codeword per codebook, each indicated by the binary assignment 
vector b^. This yields a novel and more accurate composite ap¬ 
proximation scheme ~ Consistent with 

LSA and kmeans, the sum of squared loss between all xj^’s and the 
sum of selected codewords after transformed by R^, is minimized. 


Nv 

min 

n=l 




m *-'mn 


S.t. ||b„in|lo = € {0, 1}^ 

m = 1... M, n = 1... Ny, 


( 1 ) 


where H-Hq denotes the ^o-norm that simply counts the number of 
the vector’s nonzero elements. The constraint guarantees that only 
one codeword in each codebook can be activated to approximate 
the input data, hence it can lead to compact binary codes. As the 
binary constraints are directly imposed to the learning objective and 
are valid throughout the optimization procedure, the derived binary 
codes are much more accurate than sign thresholding binary codes. 
The rationale of using M codebooks instead of single codebook to 
approximate each input datum is to further minimize quantization 
error, as the latter is shown to yield significantly lossy compression 
and incur evident performance drop |[^|^. Quantization based on 
multiple codebooks yields balanced composite binary codes which 
are more effective than Hamming embedding binary codes fT^flT) . 

3.2.2 Inter-Modality Correlation Maximization 
The most desirable value of multimodal retrieval is to enable 
transfer of knowledge across different modalities so that cross-modal 
retrieval performance can be improved. A fundamental assumption 
for multimodal retrieval is that by mapping objects in a modality- 
consistent latent space, the latent space representations of semanti¬ 
cally relevant inter-modal pairs should be consistent. More specif¬ 
ically, for each input object with both image modality x\ and text 
modality x^, after being transformed by R^ and R^ in Equation |T}, 
the latent space representations for image modality C>^h\ and text 
modality should be similar. To our knowledge, most prior 

work adopts the coupling strategy to minimize 11C ^ b^ 11 ^. 

In this paper, we propose to maximize cross-modal correlation by 
sharing codebooks {Cm}m=i for different modalities and sharing 
binary codes {bnl^L^ for semantically relevant inter-modal pairs. 
While for the data points with only one modality, the multimodal 


where (5(') distinguishes multimodal objects from unimodal ones. 
Different from most prior methods p0|[8), our modality-consistent 
condition requires identical code b^ = b^, instead of minimized 
distance 11 bj^ — b^ 11, for the semantically relevant inter-modal pairs. 
There are two advantages of our approach. First, since our learning 
objective keeps the binary constraint valid throughout optimization 
procedure, it is very difficult to require minimized distance between 
two binary codes as their nonzero elements may differ significantly. 
Note that prior methods simply drop the binary condition and solve 
a continuous problem, which leads to uncontrollable quantization 
error with the post-step sign thresholding. Second, integrating the 
minimized distance condition in the learning objective as existing 
methods may introduce a trade-off term, or parameter, that is hard 
to tune since its magnitude is very different from learning loss Q. 


3.2.3 Joint Optimization Framework 
To approach CCQ, which jointly learns a similarity-preserving 
composite quantization and a correlation-maximal latent space in a 
unified optimization framework, we jointly require the codebooks 
{Cm}m=i subject to minimizing the quantization error of all modal¬ 
ities as Equation Q, and the mappings R^ subject to maximizing 
the correlations between semantically relevant inter-modal pairs as 
Equation ^ while jointly minimizing the reconstruction error of 
input data as LSA. This leads to a joint optimization framework as 


mm 


A/ Ny 

v=l n=l 




s.t. = Idxd,R” e 

Il5(b:;,j|lo = i,5(b;;,)€{o,i}^ 

A \ _ J Tl — 1 . . . A^o 

( mnj - otherwise 

u = n = 1... Ny, 


( 3 ) 


where A^; is the weight parameter for each modality, and in bimodal 
problems with 1/ = 2, we can simplify the notations by denoting 
Ai = 1 and A 2 = A, while such notations are used throughout 
this paper. R^ is the transformation matrix that maps the inputs of 
each modality to a D-dimensional modality-consistent latent space. 
The orthogonal constraints are motivated by LSA, which can turn 
latent factors R^ into transformation matrices for efficient out-of- 
sample quantization. The binary codes b^ are M x if-dimensional, 
fortunately however, each is 1-of-if encoding with only one 
nonzero element and can be represented using log 2 if bits, hence 
the final hash codes can be compacted into H = Mlog 2 if bits, 
which is independent on the dimensions of input or latent spaces. 
To fit each bj^^ into one byte, if = 256 is a good choice 
We simply set D = min({P^;}Jf=i, ii), in the sense that an ii-bit 
binary code can reconstruct a D-dimensional vector accurately. 

Formally, we derive correlation-maximal mappings (xj^) = 
R^^xJi and similarity-preserving quantizers q'^ (/^ (^n)) = 
There are several advantages of the CCQ approach. First, CCQ 
jointly learns a correlation-maximal latent space and a similarity- 
preserving composite encoding, which can minimize the quantiza¬ 
tion loss and guarantee search quality. Second, CCQ explores both 
paired and unpaired data in a semi-paired quantization paradigm. 










which can benefit from semi-supervised learning in that paired data 
consolidate inter-modality correlation and unpaired data enhance 
intra-modality quantization. Third, CCQ is formulated with only 
two easy-tuning model parameters D and A, where D can be set as 
simply as LSA to retain most covariance information, and A can be 
selected by trading off different modalities using prior information. 
In particular, the proposed sharing of codebooks and binary codes 
across modalities enables joint learning of latent semantics that 
are maximally correlated in the isomorphic feature space, which 
contributes most significantly to the efficacy of the CCQ approach. 

3.3 Approximate Nearest Neighbor Search 

Approximate nearest neighbor (ANN) search based on Euclidean 
distance is a powerful task for quantization techniques fT^. G iven 
a database of CCQ hash codes we follow | [l2f]l7) and 

nsQ Asymmetric Quantizer Distance (AQD) as similarity metric that 
computes the distance between query and database point as 

AQD(q",x;;) = ||q"-R"y]“ 

M ^m=l II 2 

= -2E"=1 + ||E^=i (4) 

+ ||q^||^ + ||RlV|[, 

where = R^^q^ is the transformed query. In the second row, 
the first term computes the inner products between q^ and M code¬ 
words selected by . Given a query, these inner products for all 
M codebooks {Cm}m=i and all K possible values of can be 
pre-computed and stored in a query-specific M x K lookup table, 
which is used to compute AQD between the query and all database 
points, each entails M table lookups and additions and is slightly 
more costly than Hamming distance. The second term computes 
the squared norm of decoded database point, which is independent 
on the query and can be encoded using one byte by quantizing these 
scale values on held-out dataset ||^. At quantization, we augment 
CCQ code with the norm byte, which costs one more lookup and 
one more byte per database point. We can eliminate this norm byte 
by composite quantization p0| , but will leave it to our future work. 

4. ALGORITHM AND ANALYSIS 

4.1 Learning Algorithm 

The CCQ optimization problem consists of three variables, 
R^, C, and B^. We adopt alternating optimization 
which iteratively updates one variable with the rest variables fixed. 

4.1.1 Update IC 

We update R^ by fixing C and as known variables, and write 
Equation ^ with R^ as unknown variables in matrix formulation, 

mina.||X^-R^C.5(B^)||^^ 

s.t. R^'^R” = Idxd. 

This is equivalent to the Orthogonal Procrustes problem and 
can be solved exactly using SVD. More specifically, we perform 
SVD as (B^E = USV"^, then we achieve R” = UV"^. 

4.1.2 Update C 

We update C by fixing R^ and B^ as known variables, and write 
Equation with C as unknown variables in matrix formulation, 

minV'" ||r”'^X” - . (6) 

c ^v=i II IIf 


Algorithm 1: CCQ: Composite Correlation Quantization 

Input: Data latent dimension D, modal weight A. 

Output: Mappings {R^}, codebook C, binary codes {B^}. 

1 Initialize {R^} by identity, C randomly, {B^} by NN search. 

2 repeat 

3 Update {R^} by Orthogonal Procrustes as Eqn. (0. 

4 Update C by Quadratic Optimization as Eqn. 

5 forn ^ 1 to Ny do 

6 I Update {bJi} by ICM or greedy algorithm as Eqn. 0. 

7 end 

8 until Convergence 


This is an unconstrained quadratic problem with analytic solution 
Algorithms such as L-BEGS can be used to speed up computation. 


4.1.3 Update 

It is obvious that each is independent on {b^/ }n'^n^ then the 
optimization problem for B^ is decomposed to Ny subproblems. 


mm 

b: 


in . 

V Z-^y=\ 


R '' vV V 

X^, 






(7) 


s.t. ||5(b:;,j|io = i,5(b:;,j€{0,i}'^. 


This optimization problem is generally NP-hard. As shown in |30| , 
this problem is essentially high-order Markov Random Eield (MRP) 
problem and can be solved by the Iterated Conditional Modes (ICM) 
algorithm 0 which solves M indicators alternatively. 

Given fixed, we update by exhaustively check¬ 

ing all the codeword in codebook Cm, finding the codeword such 
that the objective in (|^ is minimized, and setting the corresponding 
entry of hmn as 1 and the rest as 0. The algorithm is guaranteed to 
converge, and can be terminated if maximum iterations are reached. 
To accelerate quantization, we can explore hierarchical structure of 
codebooks {Cm} and update {bj^^} by a new greedy algorithm. 
Specifically, after updating {b^/^}m'<m^ we can update bj^^ by 
encoding residual R^^xJ^ — Cm'(^ (h>m'n) with codebook 

Cm- The overall learning procedure is summarized in Algorithm[2 


4.2 Large-Scale Implementation 

Batch algorithms are memory-inefficient for large-scale datasets, 
hence we formulate CCQ optimization into mini-batch algorithms 
for large-scale problems | [24) . The main idea is to split the training 
set into mini-batches and load a fraction of data points into memory 
each time. Hence, the memory usage stays constant when the size 
of the training set increases. The update of B^ in Equation ^ is 
already mini-batch in that update of each data point is independent 
on the other data points. To update R^ in mini-batch, we notice that 
the matrix for SVD is X^p-J (B’')]'^ € which if given, 

the SVD can be solved in 0{PyD), independent on the number of 
data points. We thus formulate the matrix for SVD in a point-wise 
summation form as [G(5 (bn)]^, then it can be computed 

by traversing all data points in a mini-batch paradigm. Similarly, 
the update of C can also be formulated in a summation form for 
mini-batch implementation. Note that we can allocate all available 
memory to mini-batch and trade off memory and disk reading costs. 


4.3 Computational Complexity 

We analyze the cost of each iteration to show CCQ scales linearly 
to sample size Ny. To update R^, it takes O {NyPyD + NyDM) 
to prepare the problem and O {PyD + to compute the SVD. 






To update C, it takes O [NyPyD + NyDM + NvM‘^) to prepare 
the problem and O to compute the quadratic 

optimization. To update it takes O {NyPyD + NyDMKTi), 
where Ti is the number of iterations and = 3 in ICM algorithm 
or Ti = 1 in greedy algorithm can obtain satisfactory performance. 
As a rule of thumb, D — H and K = 256 are good choices for 
most applications. For longer codes, update of C is inefficient, in 
which case we can adopt the online L-BFGS algorithm for speedup. 

4.4 Approximation Error Analysis 

Given a query and a database point x^, after transformed by 
correlation-maximal mappings = R^^q^ and 
they can be comparable in the modality-consistent latent space, and 
their Euclidean distance is computed as d (q^, xj^) = 11 q^ — xj^ 11 ^. 
As computing Euclidean distance on real-valued vectors is too costly 
for large-scale search, we compute AQD (|^ on binary codes. Hence, 
we need to analyze the error bound of using AQD to approximate 
real-valued distance. Denote xj^ = the decoded 

vector of xj^, then AQD (q^, xj^) = d (q^, xj^) + e, e is a constant. 


Theorem 1 (Bound). The error is bounded by learning loss 


\d (q", x:;) - d (q^ x;:) I ^ llx” - R'' CmK 


( 8 ) 


Prooe. Erom the triangle inequality, \d (q^,x5i) — d (q^,x5i) | ^ 
Then 




<' -1- -u 

^ ||rL x^ — \\^± 

II 112 

— U-^n ^ Z^m=l ’ 

(9) 

where R^ is an orthogonal complement of R^, R^^R^ =0. □ 


The theorem confirms that the error of using AQD to approximate 
real-valued distance is statistically bounded by CCQ learning loss. 
Hence, CCQ is more accurate than sign thresholding methods pS) . 
An important advantage of CCQ in Equation (|^ is that mapping 
R^ is learned by a joint optimization of canonical correlation anal¬ 
ysis (CCA) and principal component analysis (PCA) corresponding 
to the first and second terms of Line 2 in Equation (|^. This can be 
much more effective than most CCA-based methods |13|[^[^ . 


5. EXPERIMENTS 

We conduct extensive evaluation of CCQ against state of the art 
methods on three public multimodal datasets. We investigate both 
effectiveness and efficiency in terms of search precision, recall, and 
time. The codes, data, and configurations will be available online. 

5.1 Datasets 

The evaluation is conducted on three datasets: NUS-WIDE 
Wiki 1^, and ElickrlM fTT) , with statistics depicted in Table|2 
We preprocess all datasets by applying ZCA p4| to normalize each 
dimension of image/text features to be zero mean and unit variance. 

NUS-WIDE^is a Web image dataset containing 269,648 images 
downloaded from Elickr, each associated with 6 tags on average. 
There are 81 ground truth concepts manually annotated for search 
evaluation. Eollowing prior works we prune the original 

^ http ://lms .comp.nus .edu. sg/research/NUS-WIDE.htm 


Table 1: The Statistics of Three Datasets 


Dataset 

NUS-WIDE 

Wiki 

ElickrlM 

Complete Set 

195,834 

2,866 

1,000,000 

Labeled Set 

195,834 

2,866 

25,000 

Query Set 

2,000 

693 

1,000 

Database 

193,834 

2,173 

24,000 

Training Set 

10,000 

2,173 

975,000 


NUS-WIDE to form a new dataset consisting of 195,834 image-text 
pairs by keeping the pairs that belong to one of the 21 most frequent 
concepts. The images are represented by 500-dimensional bag-of- 
words vectors extracted from the SIET features using k-means, and 
the texts are represented by 1,000-dimensional vectors extracted 
from the tag occurrence features using PCA. A query set of 2,000 
image-text pairs are randomly sampled from the dataset, while the 
remaining 193,834 image-text pairs are serving as the database. 
The hash models are learned on the training set containing 10,000 
image-text pairs randomly sampled from the database p4l[20| . 

Wik0contains 2,866 image-text pairs selected from Wikipedia’s 
featured articles comprised of multiple sections of images and texts. 
Every image-text pair is labeled by one of the 10 concepts in the 
article categories. Each image is represented by a 128-dimensional 
bag-of-words vector extracted from SIET features, and each text is 
represented by the probability distribution over 10 topics learned by 
a latent Dirichlet allocation (LDA) model. The dataset is released 
with a query set of 693 pairs and a database of 2,173 pairs, and the 
whole database is used as the training set for hash coding fT^[^ . 

FlickrlM comprises 1,000,000 images associated with tags from 
Elickr, in which 25,000 are labeled with 38 concepts while the re¬ 
maining 975,000 are unlabeled. The public available preprocessed 
datasejjis employed for evaluation, in which each image is repre¬ 
sented by a 3,857-dimensional vector concatenated by local SIET 
feature, global GIST feature, etc (H) Each text is represented by 
a 2,000-dimensional vector extracted from tag occurrences. The 
query set contains 1,000 image-text pairs randomly sampled from 
the 25,000 labeled pairs, and the rest 24,000 labeled pairs are used 
as the database. In scalability test of CCQ (Section [5^ , all 975,000 
unlabeled pairs are used as the training set for learning hash codes. 

5.2 Comparison Methods 

We compare CCQ against many state of the art hashing methods. 

• Unsupervised hashing: Cross-View Hashing (CVhP 
and Inter-Media Hashing (IMHf1j20| are unsupervised hash- 
ing methods that extend spectral hashing to exploit the local 
structure of multimodal data for learning binary codes. 

• Deep hashing: Correspondence Auto-Encoders (CorrAEf] 

learns latent features via unsupervised deep auto-encoders, 
which captures both intra-modal and inter-modal correspon¬ 
dences, and binarizes latent features via sign thresholding. 

• Supervised hashing: Cross-Modal Similarity-Sensitive Hash¬ 
ing (CMSSh|^|[^, Semantic Correlation Maximization (SCM) 
1^ , and Quantized Correlation Hashing (QCH) are super¬ 
vised hashing methods which embed multimodal data into a 
common Hamming space using supervised metric learning. 

^ http://www.svcl.ucsd.edu/projects/crossmodal 
http://www.cs.toronto.edu/~nitish/multimodal 
^ http://staff.itee.uq .edu.au/shenht/U 0_IMH 
“ https ://github.com/fangxiangfeng/deepnet 
^ http://www.cse.ust.hk/~dyyeung/code/mlbe.zip 
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Figure 2: Precision-recall curves (top) and precision® R curves (bottom) on NUS-WIDE cross-modal search tasks @ 16 and 32 bits. 


5.3 Evaluation Protocols 

We perform four types of multimodal retrieval schemes: (1) / ^ /: 
use image queries to retrieve relevant images; (2) T ^ T: use text 
queries to retrieve relevant texts; (3) / ^ T: use image queries 
to retrieve relevant texts; and (4) T ^ use text queries to re¬ 
trieve relevant images. The first two tasks are intra-modal retrieval 
and the last two tasks are cross-modal retrieval. As CCQ can also 
handle multimodal search where both modalities are available for 
the database, we show the results of multimodal retrieval schemes 
where each image-text pair is quantized into a unified hash code by 
fusing knowledge of different modalities: (5) I IT: use image 
queries to retrieve relevant image-text pairs; (6) T ^ IT: use text 
queries to retrieve relevant image-text pairs. The baseline methods 
do not support multimodal search because they do not use shared 
coding for different modalities of the same object. Given a query, 
the ground truth is defined as: if a result shares at least one common 
concept with the query, it is relevant; otherwise it is irrelevant. 

We adopt Mean Average Precision (MAP) to measure the effec¬ 
tiveness of multimodal search ||^[^|^[^[^. Given a set of 
queries, we first calculate Average Precision (AP) of each query as 


AP@R = 


Ef=i P (r) S (r) 


( 10 ) 


where R is the number of retrieved documents, P(r) denotes the 
precision of the top r retrieved results, and (5(r) = 1 if the r-th 
retrieved result is a true neighbor of the query, otherwise (5(r) = 0. 
Then MAP is computed as the mean of all the queries’ average pre¬ 
cision, and the larger the MAP, the better the retrieval performance. 
In the experiments, we follow p~5][?7][^ to report MAP@R = 
50. We also report another two standard retrieval criteria, precision- 
recall curves and precision® top-R curves of all retrieval tasks. In 
addition to effectiveness, we report time and memory costs as the 
efficiency measures for query processing and model training. 

The CCQ approach involves two model parameters: dimension 
of modality-consistent subspace D and modality trade-off weight 
A. In principle, CCQ is almost immune to different choices of D, as 
long as D is large enough to retain the majority amount of covari¬ 
ance information as LSA. While no prior knowledge is available, 
we can simply set equal weights A = 1 for different modalities, 
which can already achieve satisfactory performance. Nonetheless, 


for image-text bimodal search, the text modality usually carry more 
semantic information, hence we equip CCQ with the fiexibility for 
selecting the optimal A to encode such important prior knowledge. 
Given annotation ground truths as in the evaluation datasets, we can 
automatically select D and A using cross-validation. However, we 
choose to blindly fix A = 5 throughout the comparative study. This 
is desirable as cross-validation may be impossible in the pervasive 
unsupervised multimodal search. We will study parameter sensitiv¬ 
ity in Section [5^ to validate that CCQ can consistently outperform 
the state of the arts with a wide range of parameter configurations. 

For the comparison methods, we adopt cross-validation to select 
their optimal parameters, respectively. As cross-validation requires 
annotation ground truths, this further confirms CCQ’s superior pa¬ 
rameter stability. Subject to computation burden, it is too costly to 
train CMSSH and IMH on the complete FlickrlM dataset, hence 
we randomly sample 10,000 image-text pairs to train these models. 
Each experiment repeats ten runs and the average result is reported. 

5.4 Experimental Results 

We compare CCQ with state of the art methods in terms of MAP 
and precision-recall on 4 multimodal retrieval tasks (/ ^ /, T ^ T, 
/ ^ T, T ^ /) of three datasets (NUS-WIDE, Wiki, andElickrlM). 

5.4.1 Results on NUS-WIDE 

We evaluate CCQ against state of the arts with different lengths 
of hash codes, i.e. 8,16, 32, and 64 bits, and report the MAP results 
in Tablej^ Eor all multimodal retrieval tasks, CCQ achieves signif¬ 
icantly better performance than all unsupervised hashing methods 
CVH, IMH, and CorrAE, and generally outperforms the state of 
the art supervised hashing methods CMSSH, SCM, QCH in most 
cases. It is very worth noting that, CCQ is an unsupervised hashing 
method that does not require labeled similarity information. Hence 
CCQ is particularly beneficial when labeled information is unavail¬ 
able, which is the most common scenario in big data era. A notable 
limitation of orthogonal constrained methods CVH and IMH is 
that longer codes do not necessarily improve performance in cross- 
modal tasks I ^ T and T ^ I. The reason is that these methods 
learn uncorrelated hash bits via eigenvalue decomposition on sim¬ 
ilarity matrix, which leads to unbalanced hash codes with the first 
k eigenvectors (hash bits) dominating the whole hash codes. CCQ 













































































Table 2: Mean Average Precision (MAP) Comparison of Six Multimodal Retrieval Tasks on Three Standard Datasets 


Task 


I ^ I 


T 


I 


I ^ IT 


T ^ I 


T ^ IT 


Method 



cMssm 

SCM (28 
QCH 


CCQ (ours) 



CMSSH pl 

SCM (28 
QCH 


CCQ (ours) 



CCQ (ours) 


CCQ (ours) 



CCQ (ours) 


CCQ (ours) 


8 bits 


0.3954 

0.4313 

0.4223 


0.3776 

0.4258 

0.4289 

0.4711 


0.5825 

0.4531 

0.5501 


0.5911 

0.5524 

0.5706 

0.5913 


0.4588 

0.4345 

0.4398 


0.3950 

0.4693 

0.4765 

0.5124 


0.5074 


0.5598 

0.4380 

0.4303 


0.3783 

0.4449 

0.5020 


0.5359 


0.6022 


NUS-WIDE 
16 bits 32 bits 
0.4542 0.4W 

0.4545 0.4155 

0.4478 0.4587 

0.4060 0.4356 

0.4578 0.4695 

0.4557 0.4786 

0.4859 0.4921 

0.6485 0.6W 

0.4740 0.5421 

0.5856 0.6344 

0.5968 0.6215 

0.6315 0.6606 

0.6586 0.6796 

0.6481 0.6917 

0.4713 0.4W 

0.4399 0.4203 

0.4522 0.4699 

0.4052 0.4076 

0.4648 0.4619 

0.4895 0.5050 

0.5161 0.5165 

0.5217 0.5129 

0.4582 0.4186 

0.4501 0.4634 

0.3499 0.3944 

0.4859 0.5105 

0.5195 0.5489 

0.5410 0.5413 

0.6925 0.7Tff 


64 bits 


8 bits 


0.4780 

0.4005 

0.4796 

0.4490 

0.4831 

0.4898 


0.1988 

0.1910 

0.2055 

0.1987 

0.2048 

0.2087 


0.4932 


0.2226 


0.7189 

0.6202 

0.6678 


0.4049 

0.3805 

0.5765 


0.6613 

0.6736 

0.6855 


0.5503 

0.5814 

0.6002 


0.7069 


0.6017 


0.4740 

0.4115 

0.4964 

0.3516 

0.4851 

0.5125 


0.1673 

0.1734 

0.1929 

0.1672 

0.2258 

0.2288 


0.5372 0.2338 

0.5441 0.2512 


0.4875 

0.4051 

0.4880 


0.2309 

0.2394 

0.2688 


0.4015 

0.5259 

0.5622 


0.2926 

0.3157 

0.3426 


0.5556 0.3885 

0.7153 0.6355 


Wiki 

16 bits 32 bits 
0.1969 0.2'MI 

0.1963 0.1937 

0.2086 0.2188 

0.1979 0.2007 

0.2103 0.2177 

0.2155 0.2198 

0.2265 0.2373 

0.5506 O.O'S^ 

0.4623 0.5773 

0.5889 0.6045 

0.6065 0.6188 

0.6051 0.6189 

0.6128 0.6226 
0.6286 0.6366 

0.1877 O.l"^ 

0.1896 0.1714 

0.1982 0.2033 

0.1727 0.1750 

0.2372 0.2381 

0.2343 0.2368 

0.2349 0.2371 

0.2513 0.215^ 

0.2219 0.2214 

0.2227 0.2333 

0.2928 0.3478 

0.2991 0.2537 

0.3698 0.4239 

0.3753 0.4411 

0.4000 0.4222 

0.6351 0.6W 


64 bits 


8 bits 


0.2058 

0.1935 

0.2194 

0.2126 

0.2212 

0.2252 


0.6050 

0.5239 

0.6145 

0.5738 

0.5926 

0.6165 


0.2386 


0.6714 


0.6239 

0.5989 

0.6123 


0.5812 

0.5585 

0.6060 


0.6232 


0.5487 


FlickrlM 
16 bits 32 bits 
0.6328 0.6^^ 

0.5725 0.5736 

0.6397 0.6588 

0.6304 0.6587 

0.6257 0.6615 

0.6586 0.6787 

0.7092 0.7318 

0.6085 0.6^" 

0.5973 0.6360 

0.6176 0.6389 

0.5573 0.5583 


0.6324 0.5924 

0.6355 0.6022 


0.6422 


0.6090 


0.1696 

0.1601 

0.2155 

0.1759 

0.2378 

0.2402 


0.6091 

0.5449 

0.6301 

0.5076 

0.6361 

0.6452 


0.2374 


0.6879 


0.2587 


0.7063 


0.2350 

0.1896 

0.3566 


0.5972 

0.5374 

0.6142 


0.2582 

0.4369 

0.4565 


0.5868 

0.6037 

0.6258 


0.4178 0.6548 

0.6405 0.6942 


0.6320 

0.6427 

0.6433 

0.6225 

0.5646 

0.6329 

0.5272 

0.6493 

0.6523 

0.7081 

0.6894 

0.6032 

0.5536 

0.6198 

0.5732 

0.5998 

0.6425 

0.7026 

0.7151 


0.6410 

0.6554 

0.6541 

0.6364 

0.5936 

0.6357 

0.5357 

0.6495 

0.6685 

0.7183 

0.6989 

0.5738 

0.5513 

0.6247 

0.6176 

0.5805 

0.6485 

0.7165 

0.7190 


64 bits 
0.6712 
0.5748 
0.6654 

0.6932 

0.6801 

0.6885 

0.7451 

0.6337 

0.6436 

0.6443 

0.5614 

0.6485 

0.6686 

0.6550 

0.6199 

0.5539 

0.6401 

0.5219 

0.6440 

0.6721 

0.7176 

0.6996 

0.5794 

0.5583 

0.6431 

0.6323 

0.6078 

0.6528 

0.7266 

0.7416 


via composite quantization in isomorphic space can learn balanced 
binary codes, hence its performance improves with longer codes. 

It is interesting to observe that the performances of cross-modal 
search task / ^ T is generally better than that of intra-modal search 
task / ^ while this observation does not hold for the counter¬ 
parts T ^ I and T ^ T. This seems abnormal at first sight as 
cross-modal search tasks are often more challenging than intra- 
modal search tasks due to semantic gap fTS) . However, in general, 
text retrieval is much easier than image retrieval, making different 
modalities of the objects contribute differently the cross-modal re¬ 
trieval performance. We believe that T ^ T is much easier than 
T ^ but I ^ T may be easier than I ^ I because image-to- 
image retrieval is often the most difficult task. In the case of cross- 
modal task / ^ T, the knowledge of text modality is transferred to 
image modality, making cross-modal retrieval easier. This shows 
cross-modal retrieval can be improved by knowledge transfer. 

The precision-recall curves and the precision®top-7? curves p4] 
are illustrated in Figure]^ For space limitation, only the results 
of cross-modal tasks I ^ T and T ^ / are presented, while sim¬ 
ilar trends of results are observed on intra-modal tasks I ^ I and 
T ^ T. CCQ shows the best cross-modal retrieval performance on 
all recall levels and top-7? ranks. This validates that CCQ is capable 
for diverse retrieval scenarios, which may emphasize higher preci¬ 
sion at smaller number of top-7? retrieved results, i.e. Web search, 
or higher recall tolerating fairly lower precision, i.e. vertical search. 

5.4.2 Results on Wiki 

Table also compares the search performance of CCQ and the 
state of the art methods on the Wiki dataset, which shows that CCQ 
significantly outperforms the unsupervised hashing methods CVH, 
IMH, and CorrAE, and performs comparably to supervised hashing 


methods SCM and QCH. A notable observation is that the MAPs 
are much smaller than those on the NUS-WIDE dataset. This is rea¬ 
sonable as the images of Wiki are of low-quality (low-resolution) 
and high-diversity, i.e. the text can well describe the semantics of 
the image-text pair while the image may not be well related to the 
semantics of the image-text pair, which makes it more challenging 
to capture the semantic correlations between image query and text 
database. Note that the texts of Wiki are featured articles which are 
well edited by experts and rich in semantic information, hence it is 
fairly easy to correlate a text query with the multimodal database. 

The precision-recall curves and the precision® top-7? curves |[^ 
are demonstrated in Eigure Again, CCQ is among the top¬ 
performing methods on all recall levels and all top-7? ranks. A 
noticeable performance drop can be examined from the precision- 
recall curves to the precision®top-7? curves. And this is because 
the Wiki dataset is very small-scale with only 2,173 database items, 
hence all relevant results will be retrieved at small 7? and no more 
relevant results can be further retrieved when 7? grows too large. 
This highlights the importance of evaluation with different metrics. 

A crucial superiority of CCQ over the comparison methods lies 
in that CCQ jointly learns the isomorphic latent space and compact 
binary codes by minimizing both correlation and quantization er¬ 
rors in a unified optimization framework, while comparison meth¬ 
ods merely learn the isomorphic space and binary codes in a sepa¬ 
rated two-step pipeline. As examined by CorrAE the quality of 
searching with binary codes using Hamming distance is evidently 
inferior to searching with continuous features using Euclidean dis¬ 
tance, due to substantial information loss by converting continuous 
features to binary codes without minimizing the quantization error. 
The search quality loss due to binarization is shown in Eigure [5(^ 
and for CCQ, we use for continuous features and CbJ^ for 
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Figure 3: Precision-recall curves (top) and precision @R curves (bottom) on Wiki cross-modal search tasks @ 16 and 32 bits. 
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Figure 4: Precision-recall curves (top) and precision@R curves (bottom) on FlickrlM cross-modal search tasks @ 16 and 32 bits. 


binary codes. We see that IMH and CorrAE suffer from substantial 
MAP loss (similar trends are observed from other methods) while 
CCQ is almost lossless to binarization. In other words, by jointly 
minimizing the correlation error and quantization error, CCQ can 
circumvent information loss and learn more accurate binary codes. 

5.4.3 Results on FlickrlM 

In practical retrieval systems, it is crucial to process large-scale 
datasets in both training and testing phases, and thus we compare 
CCQ with state of the art methods on large-scale FlickrlM dataset. 
We report the MAP results in Table and illustrate the detailed 
precision-recall curves and precision@top-i? curves in Figure]^ 
As mentioned before, we randomly select 10,000 image-text pairs 
as training set to learn hash functions if it is computationally too de¬ 
manding to train these methods on the complete FlickrlM dataset. 
We can observe that CCQ significantly outperforms the compari¬ 
son methods on all retrieval tasks and performs better with longer 
codes. This validates the superiority of CCQ in processing large- 
scale datasets, as the experimental setting on FlickrlM is consistent 


with real-word system setting where a sufficiently accurate model 
needs to be derived on a sufficiently large training set. We will ex¬ 
amine CCQ’s ability to process real semi-paired data in the sequel. 

5.5 Semi-Paired Data Quantization 

Most of the existing methods, including the ones in comparison, 
require full correspondences between different modalities. In other 
words, the multimodal data objects are fully paired, e.g. image-text 
pairs. As a result, these methods are unable to tackle more realis¬ 
tic scenarios in which only a limited number of paired data points 
are available. CCQ explores the idea of semi-supervised learning 
and can leverage both paired data (all modalities of the objects are 
available) and unpaired data (partial modalities of the objects are 
available) to boost the search quality when paired data are limited. 
To verify this, we consider the NUS-WIDE and FlickrlM datasets 
and for each dataset, we randomly sample as the training set 1) 
10,000 images without text modality, 2) 10,000 texts without im¬ 
age modality, and 3) different numbers, i.e. [0.5,1, 2,4,8] x 10^, 
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(a) MAP Loss 





(d) Search Efficiency 


Figure 5: Effectiveness and efficiency experiments: (1) Loss of search quality in MAP (by red bars) due to conversion from continuous 
features to binary codes on Wiki, (b)-(c) the MAP of CCQ w.r.t. different numbers of paired data points (the number of unpaired 
data points is fixed to 10,000). Solid lines indicate training with both paired and unpaired data, and dashed lines indicate training 
with only paired data, (d) Average search time (ms) for each query via lookup tables on Wiki, NUS-WIDE, Flickr25K, and FlickrlM. 




Figure 6: Efficiency verification experiments: (a)-(b) Training time and memory costs of different methods on the complete FlickrlM 
dataset. CCQ with batch (mini-batch) training scales linearly (constantly) to the sample size, (c)-(d) The MAP of CCQ @ 32 bits 
versus parameter A G [0.1, 200] for cross-modal retrieval tasks I ^ T and T ^ I on the NUS-WIDE, Wiki, and FlickrlM datasets. 


of image-text pairs. We train CCQ with these semi-paired data and 
evaluate the search performance in terms of MAP @32 bits. 

The search performances of CCQ on NUS-WIDE and ElickrlM 
are demonstrated in Eigures [5(b^ and [5^ respectively, where solid 
lines indicate training with both paired and unpaired data, and dashed 
lines indicate training with only paired data. We can observe that 
when the number of paired data points is small, CCQ trained with 
both paired and unpaired data significantly outperforms CCQ trained 
with only paired data on most of the multimodal search tasks; when 
the number of paired data points increases, the search performance 
of CCQ will gradually saturate while the search quality of the two 
training paradigms will finally match. This clearly shows that CCQ 
can effectively leverage both paired and unpaired data (partial mul¬ 
timodal data) to boost search quality in a semi-paired data scenario. 

An unexpected phenomenon is that semi-paired training slightly 
deteriorates search performance on task I ^ T. We conjecture the 
plausible reason is that searching text database with image queries 
significantly relies on maximizing the image-text correlations to 
bridge the semantic gap between low-level image features and high- 
level image semantics, i.e. its associated texts. When the number of 
paired data points is obviously smaller than the number of unpaired 
data points, semi-paired training may tend to weaken correlation 
learning from image-text pairs and incur performance degradation. 

5.6 Search Efficiency 

To search for approximate nearest neighbors (ANN) in database 
for a given query, all methods in comparison perform linear scan 
using symmetric or asymmetric distance. Specifically, to compare 
a query vector with a database vector, CVH, IMH, and CorrAE all 
compute symmetric Hamming distance via lookup tables, and CCQ 
constructs a distance lookup table for each query that stores the 
Euclidean distances between the query and the multiple codebooks. 
As a result, CVH, IMH, CorrAE, and CCQ compute exactly the 


same number of table lookups for linear scan, while their costs of 
computing the query-codebook distance lookup tables are slightly 
different, which can be negligible as they are infinitesimal w.r.t. the 
cost of linear scan. Eor example, the cost of computing the distance 
lookup table for CCQ takes only less than 1% of the cost for linear 
scan on ElickrlM. The average search time of each query by CVH, 
IMH, CorrAE, and CCQ on the Wiki, NUS-WIDE, Elickr25K, and 
ElickrlM datasets is illustrated in Eigure [5(d)] from which we can 
observe that the search efficiency are comparable for all methods. 
While it is beyond the scope of this paper, we want to note that one 
can adopt a Multi-Index (2) approach to achieve sub-linear search 
complexity on the binary codes and further boost search efficiency. 

5.7 Training Complexity 

The training time and memory costs of CCQ scale linearly with 
the training sample size and hence can process large-scale dataset. 
To verify this, we follow p4| and use the complete ElickrlM dataset 
to evaluate the consumptions of training time and memory. CMSSH 
and IMH are not compared in this study since they require 0{N‘^) 
complexity and run out of either time or memory on this dataset. 

The comparison of training time costs is illustrated in Eigure [6^ 
We can observe that the training time of CCQ increases linearly 
with respect to the sample size. Due to multiple iterations between 
three sets of variables, i.e. transformation matrices R^, quantizer 
codebook C, and modal-specific binary codes B^, CCQ is not as 
efficient as CVH. However, CCQ performs much more efficiently 
in time than CorrAE, which is a deep learning based method solv¬ 
ing a time-demanding non-convex nonlinear optimization problem. 

The training memory consumptions are compared in Eigure [6(b)| 
Both batch and mini-batch (large-scale) implementations of CCQ 
store the model parameters in memory, which are independent of 
training dataset size. For the batch implementation, all training data 
is loaded in memory, while for the mini-batch implementation, the 















































































training data is partitioned into multiple mini-batches while only 
one mini-batch is loaded in memory each time. Hence in the mini¬ 
batch (large-scale) implementation, the memory cost stays constant 
when training dataset size increases. We can flexibly allocate mem¬ 
ory to each mini-batch to trade off memory and disk reading costs. 

5.8 Parameter Sensitivity 

Towards unsupervised multimodal retrieval, CCQ is designed to 
involve only two parameters, dimension of modality-isomorphic 
subspace D and modality trade-off weight A, and the performance 
is expected to be stable against parameter variations. Since we have 
flxed D = inm{{Pv}y=i , if), we only inspect the sensitivity of A. 

We compute MAP @ 32 bits on both cross-modal retrieval tasks 
by varying A between 0.1 and 200. The performance of CCQ w.r.t. 
parameter A is shown in Figure [6^ and |6(d^ We see that CCQ can 
consistently outperform all the unsupervised baseline methods by a 
large margin with A varying between 1 and 200. This validates that 
CCQ is robust against parameter selection and is applicable to un¬ 
supervised multimodal retrieval with easily-conflgured parameters. 

6 . CONCLUSION AND FUTURE WORK 

In this paper, we have formally approached seamless multimodal 
hashing through a novel composite correlation quantization (CCQ). 
It integrates multimodal correlation and composite quantization into 
a seamless latent semantic analysis (LSA) framework, which yields 
compact binary codes that encode both intra-modal similarity and 
inter-modal correlation. The sharing of codebooks and binary codes 
across modalities enables joint learning of latent semantics that are 
maximally correlated in the isomorphic feature space, which serves 
as the key contributor to the efficacy of the proposed CCQ method. 

In the future, we plan to equip our model with a deep learning ar¬ 
chitecture which can learn highly abstract nonlinear representations 
to better distill the correlation structures across multiple modalities. 
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